1 Introduction

Patent Grant Introduction

A patent is a legal monopoly granted by a government in return for public disclosure of an invention. A granted patent gives the proprietor the right to prevent others using the invention in the territory to which the patent applies. Patent analysis is a great management tool for addressing the strategic management of the company’s technology and product or service development process. Changing patent data into competitive intelligence allows companies to see its current technical competitiveness, to forecast technological trends, and to plan for potential competition based on new technologies.

Objective

Patent has a big effect on businesses–strategic planning, mergers, acquisitions, licensing opportunities, R&D management, human resources, competitive intelligence, business intelligence, etc. The intent of this project is to explore the methods and techniques that are commonly used in patent analysis and examine patent analysis as a tool for competitive intelligence.

Data Introduction

We extract the data from 2011-2015 count of patent grant over states. The analyzed dataset is taken from U.S. Patent And Trademark Office, Patent Technology Monitoring Team (PTMT) which contain the following information:

-U.S. or Foreign

-Code

-State-Territory or Country

-First-Named Assignee

-Year(2011:2015)

These features help to show the rate of revenue, the rate of patent and the rate of application.


2 Problem Identification

Research questions:

How about the relationship between patent data and revenue of companies? How can we predict the revenue by using patent data?


3 Approach Strategy

-Building map, a pie chart and a bar chart to find a suitable geographical region, data scope, and scale for this research

-Collecting raw data

-Basically analyzing and diagnosing the chosen data with box plots and scatter plots.

-Cleaning data,discarding NA, and omitting outliers

-Finding out the potential predictors by utilizing a correlation matrix

-Building Linear Regression Models or Other fitting medels

-Evaluating models with statistics criteria

-Testing the model with testing data (we collect data within 2011~2015 as training data, and use data in 2017 as testing data)


4 National Overview of Patent Data

The following shows how the numbers of patent distribute among the states.

The overview of patent ratio in different states:

The pie chart shows California ranks in first place with much strong percentage.

Looking into Top 10 patent holders in CA

From the perspective of time, the count of patent grants can be shown as different colours.


5 Hypothesis

Enterprise’s revenue can be predicted by using patent data as an indicator of R&D strength.


6 Data Description and Diagnosis

6.1 Raw Data

Our patent data is collected from USPTO and the revenue information is retrieved from https://www.macrotrends.net/ . This raw data we get in hand only contains the counts of patents and applications and the amount of revenue for each year from 2011 to 2015. After pre-processing, we obtain the Preliminary Data conprising growth rates whithin 2012~2015 for following analysis.

6.2 Preliminary Data

We omit University of California from our preliminary data frame, because it’s not a private business entity and we didn’t find its revenue. Hence, there are 9 companies left in our data set.

6.2.1 Nomination:

variable: denotation:
rev_rate the rate of revenue growth
pat_rate the growth rate of granted patents
pat_total the total of granted patents accumulated from 2011
app_rate the growth rate of patent applications
app_total the total of patent applications accumulated from 2011

6.2.2 Head of Preliminary data:

X rev_rate pat_rate app_rate app_total pat_total company Year
1 0.2783981 0.4616541 0.1440729 3527 1637 QUALCOMM 2012
2 0.4458147 0.5350734 0.4978084 3989 1554 APPLE 2012
3 0.2145891 1.3588957 0.5192225 5832 1095 GOOGLE 2012
4 0.0119863 0.0100784 -0.0278970 1838 1795 BROADCOM 2012
5 0.0000000 0.0976059 0.1493931 2302 1139 HP 2012
6 0.0657828 -0.0438931 0.0244233 1492 1025 CISCO 2012

6.2.3 Summary of Preliminary data:

X rev_rate pat_rate app_rate
Length:36 Min. :-0.43622 Min. :-0.21372 Min. :-0.70880
Class :character 1st Qu.:-0.03815 1st Qu.:-0.03853 1st Qu.:-0.17151
Mode :character Median : 0.05763 Median : 0.09785 Median :-0.02499
NA Mean : 0.08289 Mean : 0.15729 Mean : 0.01498
NA 3rd Qu.: 0.14933 3rd Qu.: 0.23755 3rd Qu.: 0.14540
NA Max. : 0.69405 Max. : 1.35890 Max. : 0.88103
app_total pat_total company Year
Min. : 1492 Min. : 634 APPLE : 4 2012:9
1st Qu.: 3159 1st Qu.:1328 BROADCOM: 4 2013:9
Median : 4684 Median :1950 CISCO : 4 2014:9
Mean : 7333 Mean :2483 GOOGLE : 4 2015:9
3rd Qu.: 9756 3rd Qu.:3010 HP : 4 NA
Max. :32676 Max. :7484 IBM : 4 NA
NA NA (Other) :12 NA

6.2.4 Data Normalization

In order to further analyze this data, we need to normalize pat_total and app_total.

After processing:

plot_ly(type = "box") %>%
  add_boxplot(y = top9_norm$rev_rate,name = "rev_rate")%>%
  add_boxplot(y = top9_norm$pat_rate,name = "pat_rate")%>%
  add_boxplot(y = top9_norm$pat_total,name = "pat_total")%>%
  add_boxplot(y = top9_norm$app_rate,name = "app_rate")%>%
  add_boxplot(y = top9_norm$app_total,name = "app_total")

6.3 Correlation Matrix

Afte processing preliminary data, we draw a correlation plot to figure our the correlationship among these variables. Because our goal is to build a model for predicting revenue growth, we focus on the middle row which shows the correlationship between the rev_rate and other variables. In this matrix, we know when the circle is larger the correlationship is stronger. Hence, we pick up pat_rate and pat_total as indicaters to build our models.

6.4 Scatter Plots

To know more about these two variables, pat_rate and pat_total, We also draw scatter plots. In these two plots, we can easily understand the distribution of these two data sets, and we find that there are some outliers in these two data sets.

s1=ggplot(data = top9_norm, aes(x = pat_rate, y = rev_rate)) + geom_point(aes(color = company))
ggplotly(s1)
s2=ggplot(data = top9_norm, aes(x = pat_total, y = rev_rate)) + geom_point(aes(color = company))
ggplotly(s2)

6.4.1 Omitting outliers

Before we start to build models, there is one more task we need to do. It’s to ommit outliers. The code below is the function we used to sieve out outliers. And the graph below demostrates the changes of data after we do so.

FindOutliers <- function(x){
  q = quantile(x)
  IQR = q[4]-q[2] # IQR = Q3-Q1
  lower_limit = q[2]-1.5*IQR #    Q1-1.5*IQR
  upper_limit = q[4]+1.5*IQR #    Q3+1.5*IQR
  list =c(lower_limit,upper_limit)
  return(list)
}


7 Modeling

7.1 Linear Regression Model~ pat_total

Now, we have two candidates, pat_total and pat_rate. First, We use pat_total as a predictor to build linear regression model for rev_rate.

7.1.1 Building LR model ~pat_total

lm_pat_total <- lm(rev_rate~pat_total, data=top9_cleanData2)

ggplot(top9_cleanData2, aes(pat_total, rev_rate)) +
  ggtitle("Linear Regression")+
  geom_point() +
  geom_smooth(method = "lm")

7.1.2 Model Evaluation

After builing a model, we check the following criteria.

–Q1.Is the model statistically significant? (F statistics)

–Q2.Are the coefficients significant? (t statistics and p-values)

–Q3.Is the model useful? (R2 value)

–Q4.Does the model fit the data well? (residual plots)

–Q5.Does the data satisfy the assumptions?

Coefficients:

Estimate Std.Error t value Pr(>|t|)
(Intercept) -0.000212 0.029644 -0.007 0.9943
pat_total 0.229157 0.115993 1.976 0.0585.

Residual standard error: 0.1006 on 27 degrees of freedom

Multiple R-squared: 0.1263, Adjusted R-squared: 0.09394

F-statistic: 3.903 on 1 and 27 DF, p-value: 0.0585

par(mfrow=c(2,2))
plot(lm_pat_total)

The statistic criteria and residual plots reveal that the “pat_total” is not good enough to be a predictor, so next we turn to “pat_rate”.

7.2 Linear Regression Model~ pat_rate

Secondly, We use pat_rate as a predictor to build linear regression model for rev_rate.

7.2.1 Building LR model ~pat_rate

After omitting outliers, we build linear regression on pat_rate.

A simple linear regression analysis indicates dependence between patent growth rate and revenue growth rate as the preview of the correlation analysis. The plot shows the linear model as follows.

lm_pat_rate <- lm(data = top9_cleanData,rev_rate~pat_rate)

ggplot(top9_cleanData, aes(pat_rate, rev_rate)) +
  ggtitle("Linear Regression")+
  geom_point() +
  geom_smooth(method = "lm")

7.2.2 Model Evaluation

We also check the following criteria.

–Q1.Is the model statistically significant? ( F statistics)

–Q2.Are the coefficients significant? ( t statistics and p-values)

–Q3.Is the model useful? (R2 value)

Coefficients:

Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.01590 0.02007 0.792 0.4348
pat_rate 0.20323 0.09259 2.195 0.0366*

Residual standard error: 0.09379 on 28 degrees of freedom

Multiple R-squared: 0.1468, Adjusted R-squared: 0.1163

F-statistic: 4.818 on 1 and 28 DF, p-value: 0.03663

–Q5.Does the data satisfy the assumptions?

Normality

shapiro.test(lm_pat_rate$residual)
## 
##  Shapiro-Wilk normality test
## 
## data:  lm_pat_rate$residual
## W = 0.93497, p-value = 0.06663

Independence

durbinWatsonTest(lm_pat_rate) 
##  lag Autocorrelation D-W Statistic p-value
##    1     0.003825231      1.834223   0.616
##  Alternative hypothesis: rho != 0

That test statistic value in the range of 1.5 to 2.5 ia relatively normal.

Homogeneity of Variance

ncvTest(lm_pat_rate)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.0158844, Df = 1, p = 0.89971

Regarding these three criteria, all the p-values are larger than 0.05, so the assumptions are not rejected.

–Q4.Does the model fit the data well? (residual plots)

The statistic criteria and plots show that this model is acceptable and the is no strong evidence to reject our hypothesis.


8 Model Testing

8.1 Prediction and Test

Our training data is within 2011-2015, so we use later data as testing data. After collecting patent growth rate and actual revenue in 2017, we do T-test to test the prediction made by our model.

8.1.1 T-test

Comparing our prediction with actual revenue growth:

predict_rev = data.frame(lm_predict = predict(lm_pat_rate,newdata = test_data),
                         pat_rate =test_data$pat_rate,
                         rev_rate = testing$rev_rate)
t.test(predict_rev$rev_rate, predict_rev$lm_predict)
## 
##  Welch Two Sample t-test
## 
## data:  predict_rev$rev_rate and predict_rev$lm_predict
## t = 0.46001, df = 6.5616, p-value = 0.6604
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07396614  0.10909785
## sample estimates:
##  mean of x  mean of y 
## 0.03039091 0.01282505

8.1.2 Scatter plots & MSE

p1 = ggplot(top9_cleanData, aes(pat_rate, rev_rate)) +
  ggtitle("Prediction for 2017 Revenue Growth Rate-LM")+
  geom_point() +
  geom_smooth(method = "lm")+
  geom_point(data=testing, colour="red", size=3)
ggplotly(p1)
MSE_LR = mean((testing$rev_rate - predict_rev$lm_predict)^2)
MSE_LR 
## [1] 0.00992566

9 Conclusion

The total and the growth rate of applications have no significant impact on revenue growth.

The total and the growth rate of patent grants have more significant influence upon revenue growth.

Furthermore, comparing the two variables of patent grants, the influence from the growth rate is more obvious than the influence from the total.


10 Limitanion & Improvement

The scope of this research is limited. Only Top 9 companies in the patent grant rank are investigated.

The amount of obsevations is not sufficient. More data is required to make the models more convincing.

The time of obsevation is limited. Only data before 2017 is collected and analyzed.